According to the documentation(https://search.r-project.org/CRAN/refmans/spData/html/boston.html),this dataset contains housing data that was collected as part of the 1970 census of Boston, Massachusetts.The corrected data from the Harrison and Rubinfeld (1978) are contained in a data frame, which is comprised by 506 rows and 20 columns.Each observation (row) in the dataset contains a collection of statistics corresponding to a single census ‘tract’ (a small geographic region containing multiple houses, defined specifically for a census). Some notes are that that MEDV is censored, in that median values at or over USD 50,000 are set to USD 50,000.
In this project we will consider the spatial distribution of the CMEDV variable. This variable corresponds to the median value (in USD 000s) of owner-occupied housing in each census tract. Each tract is also associated with a point location; geographic coordinates for this point (measured in decimal degrees latitude and longitude), as well as the town in which it is located (within the Greater Boston area), are provided for each observation.
We are going to derive a smaller dataframe from the above data set that contains only the variables TOWN, LON, LAT and CMEDV:
| TOWN | LON | LAT | CMEDV |
|---|---|---|---|
| Nahant | -70.96 | 42.26 | 24.0 |
| Swampscott | -70.95 | 42.29 | 21.6 |
| Swampscott | -70.94 | 42.28 | 34.7 |
| Marblehead | -70.93 | 42.29 | 33.4 |
| Marblehead | -70.92 | 42.30 | 36.2 |
| x | |
|---|---|
| TOWN | 0 |
| LON | 0 |
| LAT | 0 |
| CMEDV | 0 |
Coordinates
Next to make the visualisation process easier, we include a map. In the figure below , we can see that the points representing the the latitudes and longitudes, are not matching the towns on the map.We can even observe in the second map that some towns appear to be on the water.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map
## Assuming "lon" and "lat" are longitude and latitude, respectively
Coordinates on map
The third map shows the right and wrong coordinates for Cambridge.
## Assuming "LON" and "LAT" are longitude and latitude, respectively
Zoom on map
In order to correct the data, we suppose that all coordinates are shifted by a certain amount. We assume that there are \(n_j\) observations in town \(j\), and for each observation \(k\) in town \(j\),we denote the longitudinal coordinate as \(x_{j,k} , k = 1,\dots, n_j\). Then we assume:
\[ x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k}\] where \(TC^{(x)}_j\) is the longitudinal coordinate of the center of town j, and \(\Delta^{(x)}_{j,k}\) is the displacement of observation \(k\) in town \(j\) from the town center.We also assume that the latitudinal coordinates (which we denote \(y_{j,k}\)) satisfy a similar relationship. The suggested systematic error is therefore such that \((TC^{(x)}_j ,TC^{(y)}_j)\) has been misspecified for \(j = 1, \dots, n\) where n is the number of towns.
To find the displacement, we are going to use the correct center coordinates for each town in Boston that exist in the file BostonTownCentres.csv. First we are going to have a quick look at the data.
Note: We can see that the towns in this instance are of type character.
## Rows: 92 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): town
## dbl (2): lat, lon
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| town | lat | lon |
|---|---|---|
| Arlington | 42.41537 | -71.15644 |
| Ashland | 42.26066 | -71.46413 |
| Bedford | 42.49173 | -71.28179 |
| Belmont | 42.39593 | -71.17867 |
| Beverly | 42.55843 | -70.88005 |
Next we’re using an appropriate mutating join to combine the two data
sets.We check and observe that the number of columns in
boston.c doesn’t match the number of columns in the new
data frame.We find that the missing data corresponds to Saugus, which is
spelled as Sargus in boston.c. As a result, we correct the instances of
Sargus and join the corrected data frame with BostonTownCentres.This
time the column match.
#Join data frames
join.coord<-centre.coord %>% left_join(BostonData, by=c('town'='TOWN'))
#Check number of rows match
nrow(join.coord)==nrow(BostonData)
## [1] FALSE
##Find the town that's missing
setdiff(unique(BostonData$TOWN), unique(join.coord$town))
## [1] "Sargus"
#Empty dataframe to avoid duplicates
join.coord<-NA
##Correct missing values
BostonData$TOWN[BostonData$TOWN=='Sargus']<-'Saugus'
#Join correct data frames
join.coord<-centre.coord %>% left_join(BostonData, by=c('town'='TOWN'))
nrow(join.coord)==nrow(BostonData)
## [1] TRUE
Next we’re going to visualize the correct coordinates.We can already observe that there are no points on water and they seem to match the towns on the map.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Correct coordinates on map
We’re going to zoom into an area to check if everything is in order.
## Assuming "lon" and "lat" are longitude and latitude, respectively
Zoom on correct coordinates on map
Zoom on correct coordinates on map
In order to fix our data set, we need replace the centroid for each town (i.e. for \(j = 1,\dots,n\)) of the \(n_j\) boston.c locations with the true town center. First, we are going to find the centroid in our dataset by grouping the data by town and finding the mean longitude and latitude. Then we calculate the displacement as so: \[x_{j,k}=TC^{(x)}_j+\Delta^{(x)}_{j,k} \Rightarrow \Delta^{(x)}_{j,k}=x_{j,k}-TC^{(x)}_j\] In the equation above, \(x_{j,k}\) is known and is equal to the coordinates in boston.c and \(TC^{(x)}_j\) was calculated above as the mean lon and lat. After, we add the displacement of each town to the centroids contained in BostonTownCentres.csv and create a new dataframe containing two columns with the true coordinates for each observation. Hence we add to the above combined dataframe.
#Calculate the centroid in old data set
centroid<-BostonData %>% group_by(TOWN) %>% summarise(centre_lon=mean(LON),centre_lat=mean(LAT))
#data frame for correct lon-lat
new_cord<-data.frame(cor_lon=as.double(),cor_lat=as.double)
##Loop through all names in centroid
for (name in centroid$TOWN){
#Create a temporary data frame from our data containing the lon and lats of the town equal to name
temp<-BostonData %>% filter(TOWN==name)
#Create temporary data frames containing the wrong and correct cenrtoids of the town equal to name
temp.centre<-centroid %>% filter(TOWN==name)
cor.centroid<-centre.coord %>% filter(town==name)
#Calculate displacement for both lon-lat
dislon<-temp$LON-temp.centre$centre_lon
dislat<-temp$LAT-temp.centre$centre_lat
#Calculate the right coordinates
cor_lon<-cor.centroid$lon+dislon
cor_lat<-dislat+cor.centroid$lat
#Add the right coordinates to our new dataframe
new_cord<-rbind(new_cord, cbind(cor_lon,cor_lat))
}
#Combine the new data frame
join.coord<-cbind(join.coord,new_cord)
Final maps
Final maps
Finally, we construct a visualisation that shows the spatial distribution of the median value of owner-occupied housing in Greater Boston in 1970. In this instance, we are going to use ggmap.We observe that for some towns have only one observation so we can’t create polygons.
## Source : https://maps.googleapis.com/maps/api/staticmap?center=42.36008,-71.05888&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-0NQyKizPR9jdAYCfTiyB5IhVfbdU2xI